Machine learning applied for real estate industry is relevant, as it allows real estate companies to set selling prices or rent efficiently. Real estate agents only need to visit the property and gather necessary information (e.g., square meters, number of rooms, bathrooms, etc.) that are then translated to data. Thereafter, machine learning model provides rapid and efficient estimates. Rather than relying solely on the agents’ prediction, machine learning brings a more scientific and data driven approach which provides a benchmark that can be adjusted by the agent based on his/her experience and knowledge.
In this project, several models of supervised and unsupervised machine learning are used to predict housing and apartment rent in Brazil. The fact that the data set relies not on the prediction of the rent but of the country where the data is collected is what makes this project original. These types of data sets are commonly from the United States and are rarely from Latin America.
This project aims to analyze observations in order to have a better understanding of the data, through investigation of variables in the data set, processing different variable types (e.g., dummy, factor, numeric, categorical, etc.). Furthermore, discover the use of wide range models, discuss and compare them and finally select the best model regarding this data set.
We have found our dataset on Kaggle, a platform gathering an online data scientist community. There were two versions of the dataset, but we picked the second one as it was the most complete and recent one. This dataset contains 10’692 properties to rent:
| Dataset | Source | Type | Observations | Features | Description |
|---|---|---|---|---|---|
| Brazilian houses to rent v2 | https://www.kaggle.com | CSV | 10’692 | 13 | city, area [m2], rooms, bathroom, parking spaces, floor, animals, furniture, in BRL (R$): HOA, rent amount, property tax, fire insurance, total |
For the moment, we are keeping total, but we will remove it when we fit the models as the value we want to predict rent is contained in total.
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | city [factor] | 1. Belo Horizonte 2. Campinas 3. Porto Alegre 4. Rio de Janeiro 5. São Paulo |
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 | area [integer] | Mean (sd) : 149 (537) min < med < max: 11 < 90 < 46335 IQR (CV) : 126 (3.6) | 517 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 | rooms [integer] | Mean (sd) : 2.5 (1.2) min < med < max: 1 < 2 < 13 IQR (CV) : 1 (0.5) | 11 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4 | bathroom [integer] | Mean (sd) : 2.2 (1.4) min < med < max: 1 < 2 < 10 IQR (CV) : 2 (0.6) |
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 5 | parking.spaces [integer] | Mean (sd) : 1.6 (1.6) min < med < max: 0 < 1 < 12 IQR (CV) : 2 (1) | 11 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 6 | floor [factor] | 1. - 2. 1 3. 10 4. 11 5. 12 6. 13 7. 14 8. 15 9. 16 10. 17 [ 25 others ] |
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 7 | animal [factor] | 1. acept 2. not acept |
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 | furniture [factor] | 1. furnished 2. not furnished |
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 9 | hoa..R.. [integer] | Mean (sd) : 1174 (15592) min < med < max: 0 < 560 < 1117000 IQR (CV) : 1068 (13.3) | 1679 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 | rent.amount..R.. [integer] | Mean (sd) : 3896 (3408) min < med < max: 450 < 2661 < 45000 IQR (CV) : 3470 (0.9) | 1195 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 11 | property.tax..R.. [integer] | Mean (sd) : 367 (3108) min < med < max: 0 < 125 < 313700 IQR (CV) : 337 (8.5) | 1243 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 12 | fire.insurance..R.. [integer] | Mean (sd) : 53.3 (47.8) min < med < max: 3 < 36 < 677 IQR (CV) : 47 (0.9) | 216 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 13 | total..R.. [integer] | Mean (sd) : 5490 (16485) min < med < max: 499 < 3582 < 1120000 IQR (CV) : 4706 (3) | 5751 distinct values | 0 (0.0%) |
Looking at the summary and head of our dataset, we observed that we need to modify some variables. We renamed some of them to have a correct display of their name. According to the author of the dataset, the value “-” in floor feature correspond to houses on the ground level. Therefore, we changed it to 0.
| city | area | rooms | bathroom | parking | floor | animal | furniture | hoa | rent | property tax | fire insurance | total |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sao Paulo | 70 | 2 | 1 | 1 | 7 | yes | yes | 2065 | 3300 | 211 | 42 | 5618 |
| Sao Paulo | 320 | 4 | 4 | 0 | 20 | yes | no | 1200 | 4960 | 1750 | 63 | 7973 |
| Porto Alegre | 80 | 1 | 1 | 1 | 6 | yes | no | 1000 | 2800 | 0 | 41 | 3841 |
| Porto Alegre | 51 | 2 | 1 | 0 | 2 | yes | no | 270 | 1112 | 22 | 17 | 1421 |
| Sao Paulo | 25 | 1 | 1 | 0 | 1 | no | no | 0 | 800 | 25 | 11 | 836 |
| Sao Paulo | 376 | 3 | 3 | 7 | 0 | yes | no | 0 | 8000 | 834 | 121 | 8955 |
As seen in the last column of the summary, we don’t have any missing values.
We have several instances that are duplicates. We only need to keep the first occurrence of the duplicates. By removing them, we go from 10’692 instances to 10’334 observations.
| city | area | rooms | bathroom | parking | floor | animal | furniture | hoa | rent | property tax | fire insurance | total |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Belo Horizonte | 15 | 1 | 1 | 0 | 3 | no | yes | 0 | 1100 | 0 | 15 | 1115 |
| Belo Horizonte | 15 | 1 | 1 | 0 | 3 | no | yes | 0 | 1100 | 0 | 15 | 1115 |
| Belo Horizonte | 15 | 1 | 1 | 0 | 3 | no | yes | 0 | 1100 | 0 | 15 | 1115 |
| Belo Horizonte | 20 | 1 | 1 | 0 | 6 | no | no | 420 | 1050 | 79 | 14 | 1563 |
| Belo Horizonte | 20 | 1 | 1 | 0 | 6 | no | no | 420 | 1050 | 79 | 14 | 1563 |
| Belo Horizonte | 20 | 1 | 1 | 1 | 0 | yes | yes | 0 | 1100 | 0 | 15 | 1115 |
Looking at some of the features in the summary, we see that there is a high possibility of having outliers in area, hoa and property_tax. We can quickly check all the features distribution with boxplots.
We should try to apply to some of the features a log transformation to better see their distribution. But before applying the transformation, we should look into some visible outliers. As we had guessed,
area, hoa and property_tax have some extreme values. floor and rent also seem to have outliers. By looking at the top 5 values of each of these features, we can find some discrepencies:
| city | area | floor | rent | hoa | property tax |
|---|---|---|---|---|---|
| Belo Horizonte | 46335 | 11 | 8500 | 960 | 646 |
| Sao Paulo | 24606 | 12 | 8100 | 2254 | 7859 |
| Campinas | 12732 | 3 | 1600 | 700 | 96 |
| Belo Horizonte | 2000 | 0 | 4956 | 0 | 200 |
| Belo Horizonte | 2000 | 2 | 5000 | 3000 | 200 |
| city | area | floor | rent | hoa | property tax |
|---|---|---|---|---|---|
| Belo Horizonte | 155 | 4 | 2790 | 1117000 | 64 |
| Sao Paulo | 340 | 7 | 12000 | 220000 | 1000 |
| Sao Paulo | 285 | 6 | 20000 | 200000 | 1834 |
| Rio de Janeiro | 35 | 1 | 4500 | 81150 | 9900 |
| Porto Alegre | 42 | 10 | 700 | 32000 | 40 |
| city | area | floor | rent | hoa | property tax |
|---|---|---|---|---|---|
| Sao Paulo | 42 | 6 | 2500 | 690 | 313700 |
| Rio de Janeiro | 95 | 11 | 3000 | 1024 | 28120 |
| Sao Paulo | 700 | 0 | 10000 | 0 | 21880 |
| Belo Horizonte | 260 | 10 | 11900 | 3200 | 12500 |
| Sao Paulo | 300 | 6 | 1990 | 3500 | 10830 |
| city | area | floor | rent | hoa | property tax |
|---|---|---|---|---|---|
| Sao Paulo | 700 | 0 | 45000 | 0 | 8750 |
| Sao Paulo | 350 | 0 | 30000 | 0 | 560 |
| Sao Paulo | 486 | 0 | 25000 | 0 | 2200 |
| Sao Paulo | 80 | 1 | 24000 | 875 | 0 |
| Sao Paulo | 660 | 12 | 20000 | 4800 | 1750 |
Area: We observe 3 potential outliers. It is improbable to have a property of 46’335 m2 at the 11th floor, or a 24’606 m2 and 12’732 m2 house at the 12th and 3rd floor. We will remove these values from our data set.
HOA:The first five values are probably outliers, as HOA is way too high compared to the rent. Therefore, we will remove these five observations. We have also checked the values after the five first observations, but we didn’t find anything abnormal.
Property Tax: It seems unlikely that for an area of 42m2, you pay 313’700 BRL of property tax. This is why we will remove this observation.
Rent: The 3 highest rents seem very high, but knowing that they are houses (floor = 0), it is still possible. This is why we will keep these observations. However, for the 4th highest rent of our data set, it looks a bit suspicious to have a rent of 24’000 BRL for a 80m2 on 1st floor, knowing that the average rent per month is about 5000 BRL in the city centre 1. This is why we will remove this observation.
| city | floor |
|---|---|
| Belo Horizonte | 301 |
| Campinas | 51 |
| Rio de Janeiro | 25 |
| Porto Alegre | 23 |
| Sao Paulo | 46 |
Floor: By checking the highest buildings for each cities 2, and comparing them with the highest floors of our data set, we see that we need to remove two outliers, as they have higher floors than in reality.
After these outliers removal, we have 10’323 observations left. We can now log transform some features in our dataset.
We observe for
hoa, that the data distribution in São Paulo and Belo Horizonte is very large. This is due to the fact that for some properties, there is no homeowners association tax.
For the floor feature, we have decided to modify it into intervals:
| Floor | Count | Foor | Count |
|---|---|---|---|
| 0 | 2371 | 6 | 528 |
| 1 | 998 | 7 | 487 |
| 2 | 952 | 8 | 486 |
| 3 | 894 | 9-12 | 1260 |
| 4 | 726 | 13-18 | 783 |
| 5 | 580 | 19+ | 257 |
We see a very high correlation of 0.99 between fire_insurance, and the variable we would like to predict rent. We will need to remove this feature as fire_insurance is probably calculated from the rent. There is also as expected a high correlation with total. As we previously said, we will need to remove this variable when fitting the models, but we will keep it for our EDA.
In our dataset, we have more properties accepting animals, and more properties that are not furnished. We can also observe that we have much more observation in São Paulo than in the other Brazilian cities. This might be explained by the fact that São Paulo is the most populated city in Brazil.
We also observe that if the property is accepting animals, it will have a higher rent. It is also the case, for the furniture feature, if it is already furnished, the rent will be higher. We can also see that rent is the highest in São Paulo.
We observe that if we own a house (floor = 0 often corresponds to this situation), the rent will be higher. It will also increase if you live on higher floors. As we could have expected, the rent is also increasing with the increasing number of parking spaces.
Globally, a change in the area seems to have a positive impact the rent price. Again, we see that we have more data in São Paulo. By plotting rentagainst hoa and property_tax, we have similar plots as the one below, therefore we are not displaying them.
It is interesting to see that São Paulo and Belo Horizonte have the highest rent. However, Rio de Janeiro total fees is higher than Belo Horizonte. Indeed, homeowners association tax in Rio is twice higher than the one in Belo Horizonte. The city with the highest rent in proportion to the total fees is Belo Horizonte: on average, rent represents about 80.7% of total fees.
For our EDA, we kept the total variable, but to predict the rent, we need to remove it to avoid overfitting (as rent is included in total). As we have seen in the correlation part, it is not necessary to keep fire_insurance, as it is highy correlated with the value we would like to predict. Indeed, it is computed from the rent variable. Therefore, we will remove it.
In the modelling part of our project, we will start by applying supervised learning methods and compare them. Lastly, two method of unsupervised learning will be implemented and discussed.
The supervised learning methods chosen are the following :
- The Linear regression
- The Regression tree
- The KNN
We first proceeded to split at random our data into a training set (80%) and a test set (20%) to be able to assess the performance of our models. In addition, we also created a set with the cross validation method (k-fold = 10) that will be used to train the different models.
We will start by creating a full linear model regression with all our variables, except fire_insurance and total. After that, we will proceed to a forward variables selection from the null linear model, based on AIC criteria.
Full model
\(lm(formula = rent ~ city + area + rooms + bathroom + parking + floor + animal + furniture + hoa + propertytax, data = df.tr)\)
Adjusted R-squared is quite low with a value of 0.585.
| Full model | Model Selection | |||||||
|---|---|---|---|---|---|---|---|---|
| Coefficients | Estimates | SE | t-value | p | Estimates | SE | t-value | p |
| (Intercept) | 282.07 | 128.55 | 2.19 | 0.028 | 254.50 | 124.40 | 2.05 | 0.041 |
| city [Campinas] | -525.20 | 112.48 | -4.67 | <0.001 | -530.59 | 112.30 | -4.72 | <0.001 |
| city [Porto Alegre] | -119.69 | 103.55 | -1.16 | 0.248 | -128.62 | 103.01 | -1.25 | 0.212 |
| city [Rio de Janeiro] | 479.84 | 101.15 | 4.74 | <0.001 | 474.06 | 100.92 | 4.70 | <0.001 |
| city [Sao Paulo] | 574.43 | 82.37 | 6.97 | <0.001 | 571.43 | 82.30 | 6.94 | <0.001 |
| area | 7.46 | 0.33 | 22.51 | <0.001 | 7.46 | 0.33 | 22.50 | <0.001 |
| rooms | 110.49 | 33.35 | 3.31 | 0.001 | 107.80 | 33.20 | 3.25 | 0.001 |
| bathroom | 516.86 | 31.91 | 16.20 | <0.001 | 517.08 | 31.91 | 16.20 | <0.001 |
| parking | 175.46 | 23.04 | 7.61 | <0.001 | 174.30 | 23.00 | 7.58 | <0.001 |
| floor [1] | -559.30 | 104.85 | -5.33 | <0.001 | -559.44 | 104.85 | -5.34 | <0.001 |
| floor13-18 | 140.08 | 115.17 | 1.22 | 0.224 | 138.80 | 115.16 | 1.21 | 0.228 |
| floor [19+] | 713.70 | 169.47 | 4.21 | <0.001 | 713.83 | 169.47 | 4.21 | <0.001 |
| floor [2] | -599.68 | 108.09 | -5.55 | <0.001 | -602.74 | 108.03 | -5.58 | <0.001 |
| floor [3] | -509.21 | 109.50 | -4.65 | <0.001 | -512.67 | 109.42 | -4.69 | <0.001 |
| floor [4] | -460.51 | 116.44 | -3.96 | <0.001 | -463.42 | 116.38 | -3.98 | <0.001 |
| floor [5] | -408.85 | 126.06 | -3.24 | 0.001 | -411.79 | 126.01 | -3.27 | 0.001 |
| floor [6] | -302.19 | 130.05 | -2.32 | 0.020 | -303.51 | 130.04 | -2.33 | 0.020 |
| floor [7] | -430.32 | 136.49 | -3.15 | 0.002 | -433.72 | 136.43 | -3.18 | 0.001 |
| floor [8] | -465.20 | 134.32 | -3.46 | 0.001 | -467.68 | 134.28 | -3.48 | <0.001 |
| floor9-12 | -242.26 | 101.93 | -2.38 | 0.017 | -245.32 | 101.87 | -2.41 | 0.016 |
| animal [yes] | -51.22 | 60.19 | -0.85 | 0.395 | ||||
| furniture [yes] | 1120.39 | 58.05 | 19.30 | <0.001 | 1124.49 | 57.85 | 19.44 | <0.001 |
| hoa | 0.43 | 0.03 | 14.33 | <0.001 | 0.44 | 0.03 | 14.37 | <0.001 |
| property_tax | 0.47 | 0.04 | 10.78 | <0.001 | 0.47 | 0.04 | 10.78 | <0.001 |
| Observations | 8260 | 8260 | ||||||
| R2 / R2 adjusted | 0.586 / 0.585 | 0.586 / 0.585 | ||||||
We end up with a model without animal. The adjusted R-squared is still low and remains at 58.5 %. It means that all variables, except animal, are significant based on the AIC.
We can see first of all that rent decreases significantly with the cities of Campinas and Porto Alegre. The rents will be cheaper on average and especially in Campinas -525.20 BRL (with Belo Horizonte as reference point) everything else being equal. Rents in Rio de Janeiro and São Paulo will be on average significantly higher and São Paulo will have on average the highest rents +574.43 BRL, again, everything else being equal.
We can see that, logically, area, rooms, parking and especially bathrooms are positively associated to an increase in rent, everything else being equal. This is not something we can tell from the linear regression but we have been able to see in our EDA that all these variables, the explanatory ones, are positively correlated.
On average, there is a decrease in rent depending on the floor. From the first floor to the twelfth, basement being the reference point, the rent is associated with a a lower value. Nevertheless, from floor thirteenth to eighteenth, rent starts to be positively impacted, the impact is stronger from the nineteenth floor and more (basement still being the reference level). It is surprising to see that floor zero is on average much more expensive than the first twelfth floors, everything else being equal. Indeed, properties rent, tend to be higher as the floors increase. As mentioned earlier floor zero can be houses, that have higher rents than most flats. A furnished residence will have a considerably higher rent +1120.39 BRL than an unfurnished residence.
We decided to train our model with cross-validation in order to prevent overfitting. We can see, in the following table, that the R-squared oscillates between 48.3% and 64.2% with 10 folds.
#> RMSE Rsquared MAE Resample
#> 1 2259 0.573 1409 Fold01
#> 2 2403 0.566 1444 Fold02
#> 3 2093 0.598 1368 Fold03
#> 4 2018 0.642 1330 Fold04
#> 5 2241 0.549 1435 Fold05
#> 6 2205 0.592 1405 Fold06
#> 7 2520 0.483 1571 Fold07
#> 8 1993 0.628 1325 Fold08
#> 9 2193 0.617 1480 Fold09
#> 10 2081 0.569 1320 Fold10
Now, we decided to fit our predicted values against the observations.
We can observe that our linear model makes less error in predicting rents up to more or less 5’000 BRL than in predicting rents above this limit. The graph above shows us that after 5’000 BRL the observations become even more spread out and difficult to predict.
We start by plotting an unpruned regression tree. We can observe that the variables used to predict
rent are area, bathroom, property_tax and hoa.
We can see the the model does not have a great accuracy especially, when the observed rents are higher.
#> CART
#>
#> 8260 samples
#> 10 predictor
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 7434, 7434, 7434, 7435, 7434, 7435, ...
#> Resampling results across tuning parameters:
#>
#> cp RMSE Rsquared MAE
#> 0.0494 2458 0.478 1700
#> 0.0656 2561 0.433 1793
#> 0.3950 3035 0.380 2209
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was cp = 0.0494.
The train caret function used with the cross validation has chosen a model pruned with a best tuned cp of 0.049.
Therefore here we plot the pruned tree based on the optimal cp. We can see that the tree chosen has now only 3 branches.
#> Call:
#> rpart(formula = rent ~ ., data = df.tr)
#> n= 8260
#>
#> CP nsplit rel error xerror xstd
#> 1 0.3950 0 1.000 1.000 0.0292
#> 2 0.0656 1 0.605 0.609 0.0212
#> 3 0.0494 2 0.539 0.554 0.0209
#>
#> Variable importance
#> area property_tax bathroom parking rooms
#> 27 17 17 15 12
#> hoa city
#> 10 1
#>
#> Node number 1: 8260 observations, complexity param=0.395
#> mean=3.95e+03, MSE=1.16e+07
#> left son=2 (5666 obs) right son=3 (2594 obs)
#> Primary splits:
#> area < 152 to the left, improve=0.395, (0 missing)
#> property_tax < 364 to the left, improve=0.379, (0 missing)
#> bathroom < 2.5 to the left, improve=0.359, (0 missing)
#> parking < 2.5 to the left, improve=0.282, (0 missing)
#> rooms < 3.5 to the left, improve=0.231, (0 missing)
#> Surrogate splits:
#> bathroom < 2.5 to the left, agree=0.867, adj=0.575, (0 split)
#> parking < 2.5 to the left, agree=0.850, adj=0.524, (0 split)
#> property_tax < 404 to the left, agree=0.840, adj=0.490, (0 split)
#> rooms < 3.5 to the left, agree=0.830, adj=0.460, (0 split)
#> hoa < 1690 to the left, agree=0.789, adj=0.327, (0 split)
#>
#> Node number 2: 5666 observations
#> mean=2.5e+03, MSE=3.44e+06
#>
#> Node number 3: 2594 observations, complexity param=0.0656
#> mean=7.11e+03, MSE=1.48e+07
#> left son=6 (1385 obs) right son=7 (1209 obs)
#> Primary splits:
#> property_tax < 608 to the left, improve=0.1630, (0 missing)
#> area < 300 to the left, improve=0.1270, (0 missing)
#> bathroom < 3.5 to the left, improve=0.0899, (0 missing)
#> hoa < 2790 to the left, improve=0.0535, (0 missing)
#> parking < 2.5 to the left, improve=0.0508, (0 missing)
#> Surrogate splits:
#> hoa < 2040 to the left, agree=0.706, adj=0.369, (0 split)
#> bathroom < 3.5 to the left, agree=0.667, adj=0.285, (0 split)
#> parking < 2.5 to the left, agree=0.650, adj=0.248, (0 split)
#> area < 300 to the left, agree=0.625, adj=0.194, (0 split)
#> city splits as LLLLR, agree=0.623, adj=0.190, (0 split)
#>
#> Node number 6: 1385 observations
#> mean=5.65e+03, MSE=9.71e+06
#>
#> Node number 7: 1209 observations
#> mean=8.77e+03, MSE=1.54e+07
On this summary we see that for node 1, if area is not available the model will use bathroom to predict the rent. For node 3 if property_tax is not available the model will use hoa.
| Overall | |
|---|---|
| property_tax | 0.543 |
| area | 0.522 |
| bathroom | 0.449 |
| parking | 0.333 |
| rooms | 0.231 |
| hoa | 0.054 |
| cityCampinas | 0.000 |
cityPorto Alegre
|
0.000 |
cityRio de Janeiro
|
0.000 |
citySao Paulo
|
0.000 |
| floor1 | 0.000 |
floor13-18
|
0.000 |
floor19+
|
0.000 |
| floor2 | 0.000 |
| floor3 | 0.000 |
| floor4 | 0.000 |
| floor5 | 0.000 |
| floor6 | 0.000 |
| floor7 | 0.000 |
| floor8 | 0.000 |
floor9-12
|
0.000 |
| animalyes | 0.000 |
| furnitureyes | 0.000 |
property_tax and area are the two most important variables, this is why with a tree of length three, the model uses these features. bathroom is the third most important, but it is rather used as surrogate split probably due to the fact that it highly correlated with area (0.75).
When we look at the predictions versus observed values, it looks almost like the unpruned graph, meaning pruning the tree simplifies the model but did not affect too much the errors.
The third, and last supervised learning model we performed is the K-Nearest Neighbors (KNN). We decided to let the model set the optimal k parameter (the optimal number of neighbors taken into account to allocate an observation to a class).
#> k-Nearest Neighbors
#>
#> 8260 samples
#> 10 predictor
#>
#> Pre-processing: centered (23), scaled (23)
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 7434, 7434, 7434, 7435, 7434, 7435, ...
#> Resampling results across tuning parameters:
#>
#> k RMSE Rsquared MAE
#> 5 2307 0.546 1445
#> 7 2265 0.561 1423
#> 9 2242 0.569 1411
#> 11 2242 0.570 1410
#> 13 2243 0.570 1411
#> 15 2240 0.572 1409
#> 17 2241 0.572 1411
#> 19 2242 0.573 1411
#> 21 2237 0.575 1409
#> 23 2238 0.576 1410
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was k = 21.
The model selected k with a value of 21, based on the RMSE. A high value for k provides a robust model, however the goodness-of-fit is not excellent: the R squared is lower than 0.6.
Finally, we plot the predictions of this model:
We can observe that the prediction is better for smaller rent values.
| Models | R2 | RMSE | MAE |
|---|---|---|---|
| KNN | 0.596 | 2271 | 1436 |
| Linear Regression | 0.595 | 2244 | 1435 |
| Regression Tree | 0.499 | 2500 | 1773 |
Comparing the scores of our three models we can see that the Linear Regression and KNN are equivalent regarding their R2 but the Linear Regression has a lower RMSE than the two other models. By default, Regression Tree is the worst. However even the best model does not perform very well on our data set. As we have seen previously his adjusted R2 is 58.5% which is not so good. It is important to highlight the fact that this last R2 is different from the one used in the model comparison.
To illustrate our three models, we decided to apply them on two observations: the highest rent value and one value close to the median.
| city | area | rooms | bathroom | parking | floor | animal | furniture | hoa | rent | property tax | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2156 | Sao Paulo | 700 | 4 | 7 | 8 | 0 | yes | no | 0 | 45000 | 8750 |
| city | area | rooms | bathroom | parking | floor | animal | furniture | hoa | rent | property tax | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 5170 | Porto Alegre | 62 | 2 | 2 | 2 | 6 | yes | no | 570 | 1390 | 60 |
Unfortunately, as we saw previously on the linear regression graph, our model makes considerable error in predicting rents above ~5000 BRL. The model predicts a rent of ~15’577 BRL instead of 45’000 BRL, which is three times lower than it should.
For the median one, the linear regression is closer to the observed value 1’390 BRL but still makes large errors and predicts a rent of ~2’159 BRL which is an overestimate of 770 BRL.
| Predicted | Observed | |
|---|---|---|
| Highest Rent | 15’577 | 45’000 |
| Median Rent | 2’159 | 1’390 |
For the first instance, as the area of the property is 700 m2 and the property_tax is 8’750 BRL, the regression tree model indicates to go on the right two times. The predicted rent is 8’771 BRL, while the true rent is 45’000 BRL.
For the second instance, it would predict 2’501 BRL because we would go left at the first node. Indeed, the area of this property is lower than 153m2. The prediction error is quite large as the observed rent is 1’390 BRL, almost twice smaller.
We can see that this model makes already large errors with a typical value of the data set. It does an even far worst job with the greatest observation. This explains why regression tree was the worst model among the others in the previous comparison.
With KNN, the prediction of the maximum observation is far from reality. We predict 13’690 BRL instead of 45’000 BRL, around three times under the real value.
KNN is more accurate with the median observation; it predicts 1’945 BRL while the actual value is 1’390 BRL.
| KNN | Predicted | Observed |
|---|---|---|
| Highest Rent | 13’690 | 45’000 |
| Median Rent | 1’945 | 1’390 |
In this part, we will be applying two unsupervised learning methods :
- The PAM method for Clustering
- The PCA method for Dimension reduction
area, hoa, rent and proprety tax that can take very high values while variables like the number of rooms or bathroomhave maximums of 13 and 10.
| city | area | rooms | bathroom | parking | floor | animal | furniture | hoa | rent | property tax |
|---|---|---|---|---|---|---|---|---|---|---|
| Sao Paulo | -0.550 | -0.462 | -0.897 | -0.403 | 0.364 | yes | yes | 0.995 | -0.193 | -0.195 |
| Sao Paulo | 1.297 | 1.250 | 1.225 | -1.029 | 2.814 | yes | no | 0.235 | 0.291 | 2.042 |
| Porto Alegre | -0.476 | -1.319 | -0.897 | -0.403 | 0.176 | yes | no | 0.060 | -0.339 | -0.501 |
To determine the number of clusters to be set, we used the fviz_nbclust function. As the function only takes a numeric matrix or a data frame as input, we turned the categorical variables into dummies. Since we had three categorical columns; one with five levels and the last two with two each, we passed from eleven variables to fourteen (deleting the original columns).
| area | rooms | bathroom | parking | floor | hoa | rent | property tax | city Campinas | city Porto Alegre | city Rio de Janeiro | city Sao Paulo | animal yes | furniture yes |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| -0.550 | -0.462 | -0.897 | -0.403 | 0.364 | 0.995 | -0.193 | -0.195 | 0 | 0 | 0 | 1 | 1 | 1 |
| 1.297 | 1.250 | 1.225 | -1.029 | 2.814 | 0.235 | 0.291 | 2.042 | 0 | 0 | 0 | 1 | 1 | 0 |
| -0.476 | -1.319 | -0.897 | -0.403 | 0.176 | 0.060 | -0.339 | -0.501 | 0 | 1 | 0 | 0 | 1 | 0 |
Then, we have chosen the silhouette statistic in order to maximize the goodness of fit of the clustering. As displayed on the graph below, we found out that k=2 would be the number of clusters maximizing the average silhouette of the clusters. We will not be displaying the results here but we also checked the within-cluster sum of square method [WSS] and the elbow minimizing it was also k=2.
So, we used the pam function specifying k=2, following our previous results. Here are the two medoids:
| cluster | id | city | area | rooms | bathroom | parking | floor | animal | furniture | hoa | rent | property tax |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 8797 | Rio de Janeiro | -0.439 | -0.462 | -0.897 | -0.403 | -0.201 | yes | no | -0.423 | -0.660 | -0.356 |
| 2 | 4749 | Sao Paulo | 0.411 | 0.394 | 1.225 | 0.848 | 0.553 | yes | no | 0.674 | 0.828 | 0.468 |
We can see that the medoid of cluster one seems to be a smaller property than the medoid of cluster two. With the following plots we can see that this is, indeed, confirmed. Cluster one is characterized by houses that could be called standard whereas cluster two includes houses that are bigger and exceptional. As a matter of fact we can see more outbound values in cluster two; equally in terms of houses characteristics than in financial charges. The graphical representation of the categorical variables by clusters does not give us much more information but our hypothesis of “exceptional real estate” in cluster two is once again confirmed as there are a fewer of it. We can highlight the fact that São Paulo is the only city in which the difference, between the number of houses of each cluster, is the smaller. This can be explained by the fact that São Paulo is a dynamic financial center in Brazil since many years.
The PCA method is not really appropriate for categorical variables since it is a rotation of data from one coordinate to another. Numerical data is preferable for this method. We first used our data with categorical variables turned into dummies but in total it added six dummies over fourteen variables. There is not a lot of variation in dummies and adding six of them would not be suitable for a model which aims to capture as much variance as it can in as fewer components as possible. To overcome it, we performed the PCA method exclusively on the numerical variables of our data set.
In the following plot, we can see that almost all the variables are positively correlated to dimension one. The exceptions are:
- floor which is extremely correlated to dimension two and absolutely not to the first one.
- hoa which is correlated to both first dimensions
We still observe it on the second plot. property_tax is negatively correlated to dimension three. The other variables are not or really slightly correlated to that last.
In this plot we can observe the contribution of each variable in the different dimensions. It is interesting to see that our previous statements are observed here as well. The only exception is the contribution of variable
floor to the third dimension up to 20%. An observation we could add, is that all the variables contribute in at least one dimension above the threshold of ~12%. In fact, we know the variation is conserved in the Principal Component Analysis.
To conclude our unsupervised analysis, we decided to plot the observations on the selected dimensions and to color it by cluster. In the first graph, clusters are part by dimension 2 quite clearly. Our previous statements are confirmed here again. The properties from cluster one are agglomerated, whereas cluster two houses are more dispersed. Outliers, which could be considered as exceptional properties, are all in the cluster two.
An important lesson to be drawn from this project is that the quality of a data set influences the predictive quality of the supervised learning models. Although, wide range models of data were tested due to the specific data set for this project, however, the predictive quality is just average.
This limitation is due to the presence of outliers in the data set and the relevance of some more or less correlated variables. There were some questions about the different levels of the city variable. As these are megacities, rent probably differs in the neighboring areas. If this was considered, precision would have been otherwise superior. Furthermore, the following variables could have provided interesting information for rent prediction: age of the property, type of property, crime rate, and the distance to the city center.
In the unsupervised learning section, the two methods were ineffective in the comprehension of the data. Nevertheless, it is highlighted that PCA method was not appropriate in contrast to PAM. The transformation of categorical variables into dummies is inconvenient when they are numerous and exclusive consideration of numerical variables loses a lot of variance explanation.
In conclusion, recommended model is the linear regression model of supervised learning. This is the model with the smallest errors [RMSE & MAE] and the most robust accuracy. Therefore, it’s not surprising that this model is mostly used by estate agencies.